Skip to content

feat: allow configurable Cohere API URL in CohereReranker#11

Open
davidjrh wants to merge 1 commit intodanny-avila:mainfrom
davidjrh:main
Open

feat: allow configurable Cohere API URL in CohereReranker#11
davidjrh wants to merge 1 commit intodanny-avila:mainfrom
davidjrh:main

Conversation

@davidjrh
Copy link
Copy Markdown

This pull request makes a small but important change to the CohereReranker class to improve configurability of the Cohere API endpoint.

  • The URL used for the Cohere rerank API in CohereReranker can now be overridden by setting the COHERE_URL environment variable; if not set, it defaults to the original Cohere endpoint.

This allows to specify a different base URL for Cohere, such as the ones provided by the Azure AI Foundry.

danny-avila added a commit that referenced this pull request May 4, 2026
The public LocalExecutionConfig contract documents `maxSpawnedBytes: 0`
as "no cap", but the kill check `totalSpawnedBytes > hardKillBytes`
triggered on the very first byte when the cap was zero. Skip the kill
path entirely when `hardKillBytes <= 0` so configs that explicitly opt
out of the OOM backstop behave as documented.
danny-avila added a commit that referenced this pull request May 4, 2026
, #5, #7, #9, #10, #11)

Comprehensive-review audit findings, fixed in one pass:

#1 (attachments full-file MIME sniff): switch classifyAttachment to
   open() + 12-byte header read for sniffing; only readFile() the
   full buffer when we're about to embed. A 9 MB image now allocates
   12 bytes for sniffing instead of 9 MB.

#3 (fallback walker SKIP_DIRS too small): expanded the Node-fallback
   walker's skip set to cover dist/build/.next/.cache/__pycache__/
   .venv/venv/target/vendor/coverage/.tox/.mypy_cache and friends.
   ripgrep itself respects .gitignore so this only affects the
   fallback.

#5 (dynamic fs/promises import): replaced two `await import('fs/
   promises').then(...)` calls in CompileCheckTool with a static
   `import { readFile, stat } from 'fs/promises'` per AGENTS.md.

#7 (test-only resets in public API): added @internal JSDoc to all
   three `_reset*ForTests` functions so IDE autocomplete signals
   they aren't part of the SDK surface. (The leading underscore was
   convention-only; @internal is the standard tag.)

#9 (legacy fields missing @deprecated): added @deprecated tags to
   `LocalExecutionConfig.cwd` and `allowOutsideWorkspace` pointing
   to their workspace.* replacements.

#10 (dead lexicallyInside compute in workspace policy hook): dropped
   the unused variable. Realpath is the source of truth for both
   the symlink-escape and alternate-mount cases; the lexical check
   provides no information so we don't pay for it.

#11 (voided unused imports in pi comparison script): removed the
   `void readdir; void cp;` workaround and the dead imports they
   were silencing.

Bonus: also addressed the comprehensive review's #2 (bitwise OR) by
verifying it was a misread — the code uses `||`, not `|`. Manual
review's reading was wrong; no fix needed.

Larger findings addressed:

#4 (no editStrategies tests): added
   `src/tools/local/__tests__/editStrategies.test.ts` with 12 cases
   covering exact / line-trimmed / whitespace-normalized /
   indentation-flexible match boundaries, start/end-of-file matches,
   unicode content, multi-line needles spanning blank lines,
   ambiguous-match rejection, and applyEdit semantics.

#12 (no FileCheckpointer tests): added
   `src/tools/local/__tests__/FileCheckpointer.test.ts` with 6
   cases covering capture+rewind happy path, idempotent capture,
   absent-file rewind (deletion), multi-file rewind, oversize-file
   tracking-but-not-snapshotting, and rewind across a deleted
   parent directory.

Performance:

#8 (lineWindow split-whole-file): replaced `content.split('\\n')`
   in `lineWindow` with a newline-walk that's O(start + limit)
   instead of O(file). For a 10 MB file with `offset: 1, limit: 10`
   this drops from "split the whole file into millions of strings"
   to ~10 indexOf calls. Falls back to the simple split when no
   limit is supplied.

Deferred:

#6 (spill files never cleaned) needs a Run-lifecycle hook to run
   cleanup; tracked separately rather than fixed inline.
danny-avila added a commit that referenced this pull request May 5, 2026
* feat: add local execution engine

* feat(local-engine): hardening + ergonomics follow-ups

Adds the follow-up items flagged in the design comparison so the local
engine ships with a defensible default posture.

Security & correctness
- Bridge bearer token: per-Run 256-bit token; programmatic Python and
  bash callers send `x-librechat-bridge-token`; server uses
  `timingSafeEqual`. Stops cross-process listeners on the same loopback
  port from invoking SDK tools.
- Symlink-aware workspace path: new `resolveWorkspacePathSafe()` walks
  up to the nearest existing ancestor, realpaths it, and rejects
  containment escapes — covers `write_file` to a brand-new path
  without false positives on `/tmp -> /private/tmp` style ancestors.
- Read tool guards: stat-cap (default 10 MiB, configurable via
  `local.maxReadBytes`), NUL-byte binary detection on the first 8 KiB
  before reading the rest, returns descriptive stubs.
- Local engine warns once per process when it runs without
  `@anthropic-ai/sandbox-runtime` wrapping so the no-sandbox default
  stays loud in CI/server logs.

Validation
- New `bashAst` config: `'off' | 'auto' | 'strict'` heuristic
  categorical-hazard pass over the quote-stripped command
  (command substitution, zsh privileged builtins,
  `/proc/<pid>/environ` reads, `IFS=` injection, hex-escape
  obfuscation, `source $VAR`, plus `eval`/`exec` denied in strict).
  Forward-compatible name for a future tree-sitter-bash AST drop-in.

Ergonomics
- File checkpointing: `LocalFileCheckpointerImpl` snapshots pre-write
  contents for `write_file`/`edit_file` and rewinds them on demand.
  `createLocalCodingToolBundle` returns the checkpointer alongside the
  tools when `local.fileCheckpointing: true`.
- Pluggable spawn seam: `LocalExecutionConfig.spawn?: LocalSpawn` lets
  callers route every command through SSH/containers/remote runners
  without forking the engine.
- Ripgrep fallback: `grep_search` and `glob_search` now probe for `rg`
  once per process and fall back to a Node-side walker + JS regex when
  ripgrep isn't on PATH; the old hard requirement is gone.

Tests
- 20 new unit tests covering AST findings, sandbox-off warning latch,
  checkpointer round-trip and file deletion, binary-file refusal,
  oversize stub, symlink escape, ripgrep fallback, and bridge auth
  rejection. Existing 109 tests still pass.

* feat(toolnode): fire PreToolUse/PostToolUse/PostToolUseFailure on direct path

Before this change, in-process tools added to `directToolNames` (every
graphTool with a real implementation, including the new local-engine
tools) skipped the hook lifecycle entirely. Only schema-only tools
that round-tripped through the host event-dispatch path actually fired
PreToolUse/PostToolUse/PostToolUseFailure/PermissionDenied. That meant
`createToolPolicyHook`, the new HITL `'ask'` flow, and PostToolUse
output rewriting all silently no-op'd on direct tools — including
every `bash`/`code`/`edit_file`/`write_file` from the local engine.

This adds `runDirectToolWithLifecycleHooks(call, config, batchContext)`
which wraps `runTool` with the same hook semantics
`dispatchToolEvents` uses for the single-call case, and routes the
three direct-call sites in `run()` through it. Behavior:

- Fast path: when the registry has none of PreToolUse / PostToolUse /
  PostToolUseFailure registered for this run, falls through to
  `runTool` with zero overhead. No regression for callers without
  hooks.
- PreToolUse: pre-resolves args (matching the event path), fires the
  hook with resolved args.
  - `decision: 'deny'` → synthesizes a Blocked error ToolMessage and
    fires PermissionDenied (observational).
  - `decision: 'ask'` → fail-closed deny regardless of
    `humanInTheLoop.enabled`. The event path handles 'ask' via
    LangGraph's interrupt(), which requires restructuring the direct
    invoke path; until that lands, returning 'ask' on a direct tool
    blocks the call (strict improvement vs. ignoring the hook
    entirely). A one-time warning fires when HITL is enabled, so
    hosts notice the gap.
  - `updatedInput` → applied to the call before runTool; runTool's
    placeholder resolution is idempotent on already-resolved args.
- PostToolUse: when runTool returns a non-error ToolMessage, fires
  PostToolUse and applies `updatedOutput` (preserving id/name/status).
- PostToolUseFailure: when runTool returns a status='error'
  ToolMessage, fires the failure hook (observational); the original
  error continues to propagate.

PostToolBatch participation across direct + dispatched outcomes is
intentionally out of scope: the batch hook accumulates inside
dispatchToolEvents and is fired at the end of its scope, and merging
direct-call outcomes into that aggregation crosses the two paths'
result-shaping logic. Left as a follow-up.

8 new unit tests in directToolHooks.test.ts cover deny+PermissionDenied,
ask fail-closed, allow, updatedInput, updatedOutput, PostToolUseFailure,
no-hooks fast path, and a mixed batch where one direct call denies and
the sibling runs.

Total non-API test count: 1163 -> 1171 passing.

* chore(scripts): add live-API smoke scripts for the local engine

Four scripts that exercise the local execution engine end-to-end with
a real LLM provider (matching the convention in src/scripts/*). Each
spins up a temporary workspace, runs the graph, then dumps observable
state so the behavior under test is visible in the console.

- local_engine.ts        — baseline e2e: bash + write_file + read_file
                           + edit_file + list_directory in a single
                           multi-step task. Sets `bashAst: 'auto'` so
                           the categorical-hazard pass is exercised.
- local_engine_ptc.ts    — `run_tools_with_code` (Python) calling
                           write_file / read_file / edit_file through
                           the localhost bridge in a single call. The
                           bridge token auth (timingSafeEqual) is on
                           by default for every Run, so this also
                           exercises that path.
- local_engine_hooks.ts  — direct-path hook integration: registers a
                           `createToolPolicyHook` that denies
                           write_file / edit_file plus an explicit
                           PostToolUse that prefixes every successful
                           tool result with "[reviewed]". Asks the
                           model to write then read; the write must
                           be blocked, the read must come back tagged.
                           Also asserts blocked.txt does not land on
                           disk.
- local_engine_checkpointer.ts — builds the local coding tools via
                           `createLocalCodingToolBundle({ fileCheckpointing: true })`
                           so the host keeps a handle on the
                           checkpointer, lets the model edit two
                           seeded files, then calls `rewind()` and
                           verifies originals come back. Exits non-zero
                           if rewind didn't restore both files.

npm scripts added for each: `local`, `local:ptc`, `local:hooks`,
`local:checkpointer`. All four typecheck clean (`tsc --noEmit`).

* fix(graph): pass hookRegistry on the non-event-driven path

Live integration test (`npm run local:hooks`) surfaced that hooks
weren't actually firing on local-engine tools when used with a plain
`graphConfig: { type: 'standard' }` (no `agentContext.toolDefinitions`).
Reason: `Graph.ts` only forwarded `hookRegistry` and `humanInTheLoop`
to ToolNode in the event-driven branch — the legacy non-event-driven
branch dropped both, so the policy hook silently no-op'd and a
`write_file` that the script's `createToolPolicyHook` was supposed to
deny landed on disk.

Two changes:

1. `StandardGraph.createToolNode` now passes `hookRegistry` and
   `humanInTheLoop` in BOTH branches. Hooks are conceptually
   orthogonal to whether tools dispatch as events or invoke in
   process; only the event path was historically wired.
2. `ToolNode.run`'s non-event-driven `else` branch now invokes
   `runDirectToolWithLifecycleHooks` instead of `runTool` directly,
   so PreToolUse / PostToolUse / PostToolUseFailure / PermissionDenied
   fire the same way they do on the event-driven mixed-batch path.
   The helper still falls through to `runTool` on the no-hooks fast
   path, so callers without hooks pay zero overhead.

Verified by re-running `npm run local:hooks`:
  - PermissionDenied fired 1 time(s) for write_file with the policy
    reason
  - PostToolUse saw read_file (the allowed read got the [reviewed]
    tag)
  - blocked.txt landed on disk? false (expected: false)

Also re-ran the full non-API jest suite — 1171 passing, no regressions.

* feat(local-engine): coding-tool parity push (fuzzy edit, diff, BOM/EOL, overflow)

Live cross-comparison against claude-code, pi-mono, and opencode (sst)
surfaced four meaningful gaps in the foundation tools. This commit
closes them; LSP integration and apply_patch (multi-file) are larger
and stay as documented follow-ups.

Edit fuzzy matching
-------------------
Our edit_file required exact `oldText` match, so a single mistyped
space would force the LLM to re-read and retry. New
`editStrategies.ts` walks a four-stage chain — exact, line-trimmed,
whitespace-normalized, indentation-flexible — stopping at the first
strategy that locates EXACTLY ONE match. The matched on-disk slice is
then literally replaced with `newText`; we never modify newText. The
strategy that fired is surfaced in the tool result's `strategies`
metadata so callers can see when fuzzy fallback kicked in. Inspired
by opencode's nine-strategy chain; kept to the four highest-yield
strategies for a first cut, room for more (block-anchor + Levenshtein
etc.) as needed.

Unified-diff output on edit/write
---------------------------------
`edit_file` and `write_file` now embed a unified diff in the
ToolMessage so the model sees what actually changed instead of the
prior `Applied 1 edit(s)` blob. Uses npm `diff`'s
`createTwoFilesPatch` with 3 lines of context, capped at 4 KB so
large rewrites don't blow context.

BOM + line-ending preservation
------------------------------
New `textEncoding.ts`:
- `decodeFile` strips and remembers UTF-8 BOM and CRLF/LF newline.
- `encodeFile` reapplies them on write.

`edit_file` / `write_file` now read the file, decode to LF + remember,
operate, and re-encode on write. Verified by tests: editing a CRLF
file keeps CRLF; overwriting a BOM file keeps the BOM.

Bash output overflow → temp file
--------------------------------
`spawnLocalProcess` previously truncated overflowing output in place.
Now: when total bytes exceed 2× `maxOutputChars`, the FULL
stdout/stderr is written to a temp file (`/tmp/lc-local-output-*.txt`)
and the path is appended to the formatted output as
`full_output_path: …`. The model can then `bash tail -n N <path>` (or
`grep`) to inspect specifics it actually needs. Mirrors pi-mono's and
opencode's overflow handling.

New deps
--------
- `diff@9` and `@types/diff@7` — the canonical TS diff lib opencode
  uses too, MIT-licensed.

Tests
-----
Added unit tests for each new behavior — 5 new cases for fuzzy
matching, diff embedding, CRLF preservation, BOM preservation. Ran
the full non-API jest suite: 1176 passing (was 1171). Ran all four
live integration scripts (`local`, `local:hooks`, `local:checkpointer`,
`local:ptc`) end-to-end against the real Anthropic API; all four
green and the diff payload now shows up in the model-visible tool
results.

* feat(local-engine): inline image/PDF attachments in read_file

Closes the last "Close now" parity gap: read_file can now return image
and PDF files as inline `MessageContentComplex[]` content blocks so
vision-capable models actually see the bytes, instead of getting our
old "binary file" stub.

New module
----------
src/tools/local/attachments.ts

- Magic-byte sniff for the five formats we care about (PNG, JPEG,
  GIF, WebP, PDF). Inlined the signatures rather than pulling in
  `file-type` (ESM-only, awkward under ts-jest); LibreChat's
  api/server/utils/files.js takes the same approach but uses the
  package since it's a CJS-OK Node server.
- Returns a typed Attachment union that the read tool branches on:
  image -> image_url block, pdf -> image_url block (for Anthropic's
  PDF-as-data-url path), oversize -> stub, binary -> stub,
  text-or-unknown -> falls through to the existing line-window read.
- Configurable via two new fields on `LocalExecutionConfig`:
  - `attachReadAttachments`: 'off' (default) | 'images-only' |
    'images-and-pdf'. Defaults `'off'` to preserve current behavior;
    hosts opt in when their model is vision-capable.
  - `maxAttachmentBytes`: pre-encoding size cap (default 5 MiB) that
    bounds the post-base64 token cost.

Plumbing
--------
- read_file branches on attachment.kind. For an image, it returns
  `[ [textBlock, imageUrlBlock], { artifact } ]`. LangChain's
  Anthropic adapter (verified at
  node_modules/@langchain/anthropic/dist/utils/message_inputs.js)
  preserves image_url blocks inside `tool_result` content arrays;
  OpenAI Chat Completions accepts them on vision models too. The
  Responses API flattens to text on the wire (function_call_output
  is string-only) which is the safe degrade.
- The textBlock that goes alongside the image_url carries the
  filename/mime/byte count so non-vision models still get useful
  signal.

Tests
-----
4 new unit cases:
- default behavior: still returns the binary stub
- images-only: returns array content with `image_url` data URL +
  artifact { mime: 'image/png', attachment: 'image' }
- oversize: caps embedding past `maxAttachmentBytes` -> "Refusing to
  embed" stub
- text files unaffected when embedding is enabled

29 LocalExecutionTools tests pass; 1180 non-API tests overall.

Live verification
-----------------
New `npm run local:image` script:
1. Copies a real PNG (the macOS Certificate Assistant droppedImage)
   into a temp workspace.
2. Asks the agent to `read_file` it and describe what's in it.
3. Inspects the run's ToolMessage to confirm the content is
   `[textBlock, imageBlock]` with a `data:image/png;base64,...` URL.

Live run against real Anthropic Sonnet:
  ToolMessage(name=read_file, content=[text,image_url] url=data:image/png;base64,iVBORw0KGgo…)
  Image attachment landed in tool result: true ✔
  ASSISTANT: "The image shows a blank certificate template with a
  light blue border. The word 'Certificate' appears at the top in an
  elegant, cursive script font. There's a decorative flourish or
  underline beneath the text, and a gold/yellow seal or badge
  decoration in the lower right portion..."

That's the actual content of the source PNG — vision is wired
through end-to-end.

* feat(local-engine): post-edit syntax check + compile_check tool

The two cheap "tell-the-agent-when-it-broke-the-file" alternatives I
called out as the pragmatic substitute for full LSP integration. Both
are opt-in.

1. Post-edit syntax check
-------------------------
New `local.postEditSyntaxCheck: 'off' | 'auto' | 'strict'`
(default `'off'`).

After every successful `edit_file` / `write_file`, runs a fast
per-file syntax checker keyed on extension and appends any error to
the tool result so the model self-corrects on the next turn without
a separate read round-trip:

  - `.js` / `.mjs` / `.cjs` / `.jsx`  -> `node --check <file>`
  - `.py`  / `.pyw`                   -> `python3 -m py_compile`
  - `.json`                           -> `JSON.parse` (in process)
  - `.sh`  / `.bash`                  -> `bash -n <file>`

`'auto'` appends `[syntax-check warning ...]` to the tool result and
returns success. `'strict'` throws the error so the next turn must
react. Each checker probes once for tool availability and silently
skips if missing.

TypeScript per-file syntax check is intentionally not in this list:
per-file `tsc` is slow and only catches a small fraction of TS
issues without type information. Use `compile_check` (below) for that.

2. compile_check tool
---------------------
New tool auto-bound when `engine: 'local'` and `includeCodingTools`
is on. Lets the agent ask "did my change break anything?" without us
shipping a real LSP client.

Auto-detection (first hit wins):
  - tsconfig.json or package.json[typescript] -> `npx --no-install tsc --noEmit`
  - Cargo.toml                                -> `cargo check --message-format=short`
  - go.mod                                    -> `go vet ./...`
  - pyproject.toml/setup.py with mypy         -> `python3 -m mypy .`
  - pyproject.toml/setup.py without mypy      -> `compileall .`
  - none of the above                         -> tells the agent "no toolchain"

Honours `local.compileCheck.command` and `compile_check.command` arg
overrides; honours `local.compileCheck.timeoutMs` (default 120s).
Output is truncated through the standard local-engine output cap and
spills overflow to a temp file like the bash tool does. Returns
exit-code-aware `PASSED`/`FAILED` headline + body.

Tests
-----
8 new unit cases:
  - JS/JSON broken/valid round-trip
  - returns null for unknown extensions
  - write_file appends `[syntax-check warning ...]` in `auto`
  - write_file in `strict` throws
  - compile_check reports "no recognised project marker" in an empty workspace
  - compile_check honours an explicit command override and reports exit codes

37 LocalExecutionTools tests pass; 1188 non-API tests overall.

Live integration
----------------
New `npm run local:compile` script that:
  1. Seeds a tiny TS workspace (tsconfig.json + index.ts).
  2. Asks Anthropic Sonnet to write a broken JS file (post-edit
     syntax-check warns) -> fix it -> write a broken TS file with a
     real type error -> call compile_check (must FAIL with
     `error TS2322`) -> fix it -> call compile_check again (must
     PASS).

Live run, end-to-end:
  - `[syntax-check warning via node --check]` appended to write_file
  - First compile_check: FAILED via `npx --no-install tsc --noEmit`
    (exit=2): `broken.ts(1,14): error TS2322: Type 'string' is not
    assignable to type 'number'.`
  - Second compile_check (after edit_file fix): PASSED
  - Both assertions ✔.

Other live scripts (local, local:hooks, local:checkpointer,
local:ptc, local:image) still green.

* chore(scripts): side-by-side comparison harness vs pi-mono

`npm run compare:pi` runs four identical tasks against pi-mono's CLI
and our local engine in parallel temp workspaces using the same model
(claude-sonnet-4-5 by default). Captures: tool-call shape, wall time,
input/output/cache tokens, and verifies the workspace ended in the
expected state.

Tasks
-----
- T1 simple-edit       : single literal substitution in greet.py
- T2 fuzzy-edit        : edit a file with trailing whitespace + tabs;
                         exercises both agents' fuzzy-match recovery
- T3 syntax-error-fix  : pre-seeded broken JS; ours has post-edit
                         syntax check, pi has to discover via bash
- T4 type-error-fix    : pre-seeded TS file with a real type error in
                         a tiny tsconfig project; ours has
                         compile_check, pi has bash + tsc

Pi runner spawns `node $PI_BIN --print --mode json --no-session
--provider anthropic --model <model>` and parses its JSON event
stream. Ours uses Run.create with `engine: 'local'` and walks the
final conversation. PI_BIN defaults to
~/Projects/pi-mono/packages/coding-agent/dist/cli.js.

Live results (real Anthropic, Sonnet 4.5)
-----------------------------------------
  task                    metric      pi     ours
  T1 simple-edit          verify      ✔      ✔
                          wall        8.4s   11.0s
                          tool calls  2      2
                          output tok  313    281
  T2 fuzzy-edit           verify      ✔      ✔
                          wall        16.1s  7.2s
                          tool calls  2      2
                          output tok  384    246
  T4 type-error-fix-loop  verify      ✔      ✔
                          wall        23.5s  13.6s
                          tool calls  6      4
                          output tok  805    397
  T3 syntax-error-fix     verify      ✔      ✔
                          wall        16.2s  11.7s
                          tool calls  3      4    (note: ours used 1
                                                  bash, pi used 2)
                          output tok  371    503

Both agents 4/4 verify ✔. Ours wins wall-time on 3/4 tasks, output
tokens on every task, and tool-call count on T4 (compile_check
short-circuits the bash+read+bash loop pi did).

Open follow-up surfaced by the harness: pi reports thousands of
cacheRead tokens per turn (Anthropic prompt cache rebate), ours
reports zero. We're paying full rate for the system+tools prefix on
every turn. Not a regression introduced here, but worth its own PR
to wire prompt caching on the Anthropic adapter.

* chore(scripts): compare:pi — caching enabled, +2 tasks, +variance

Updates to the side-by-side harness based on first-run findings.

1. Anthropic prompt caching wired on our side
-----------------------------------------------
The first comparison reported ours' cache_read=0 — pre-existing PR
clarification: caching IS supported (`addCacheControl` + agent
context's `hasAnthropicPromptCache()`), but it's gated on
`clientOptions.promptCache === true`. The harness's first cut set
that on `graphConfig.clientOptions`, but `Run.createLegacyGraph`
(src/run.ts:156) destructures `llmConfig` and rebuilds
`agentConfig.clientOptions` from it — `graphConfig.clientOptions`
is silently dropped on the standard path. So in this harness we
set `promptCache: true` on the llmConfig itself.

After the fix, ours reports real cache_read / cache_creation per
turn. Caching works.

2. New tasks
------------
- T5 multi-file-rename : rename `calc_total` → `calculateTotal` across
  src/lib.ts, src/index.ts, src/index.test.ts. Tests how each agent
  finds + applies the rename.
- T6 image-read-and-describe (ours-only): copies a real PNG into the
  workspace (macOS Certificate Assistant icon) and asks the model to
  describe it. Exercises our `attachReadAttachments: 'images-only'`
  end-to-end. pi has no equivalent and is skipped via the new
  `skip: 'pi'` field.

3. Variance support
-------------------
COMPARE_ITERS=N runs each task N times per side and reports the
mean. Skipped tasks count cleanly (no false N/A entries). The
verify check now force-fails when a runner errored, so a soft
verify (e.g. T6's "file-still-on-disk" check) can't mask a real
provider rejection.

4. Cost computation
-------------------
We now compute our cost from the same per-turn breakdown using
Sonnet 4.5 pricing ($3/$15 per Mtok input/output, $3.75 cache write,
$0.30 cache read), so the cost columns are directly comparable.

Live results (N=2)
------------------
Functional: pi 10/10 ✔, ours 12/12 ✔ (T6 ours-only adds 2).
Wall time: ours wins all 5 shared tasks (~30% faster on average).
Tool calls: ours fewer or equal on every shared task.
Output tokens: ours fewer on every shared task.
Cost: ours is ~6× more expensive per task even with caching on.

Why the cost gap survives caching
---------------------------------
Per-turn "input new" tokens:
  pi (T1): 35      ours (T1): 28298
  pi (T4): 76      ours (T4): 48630

Pi only sends the new turn payload (~35 tokens); ~the entire system
prefix and tool defs are cache-hits. Ours sends ~28k tokens of new
input per turn even with caching on, which strongly suggests our
cache prefix isn't covering tool definitions, OR the breakpoint
position isn't matching across turns. Worth a dedicated PR to tune
the cache breakpoints in `addCacheControl` / `AgentContext.systemRunnable`
so tool defs sit inside the cached prefix; this will close the cost
gap considerably.

The harness made this gap impossible to miss — the entire reason for
the cross-comparison.

* feat(cache): tool-prefix cache breakpoint + harness accounting fix

Two related changes; second one was the bigger win.

1. Tool-prefix cache breakpoint (Anthropic)
-------------------------------------------
New `partitionAndMarkAnthropicToolCache` helper at
src/messages/anthropicToolCache.ts. When `clientOptions.promptCache
=== true` and the provider is Anthropic, we now:

  - Stable-partition the bound tool list into [static, deferred] using
    `defer_loading` from `agentContext.toolDefinitions`.
  - Stamp `cache_control: { type: 'ephemeral' }` on the LAST static
    tool via the `extras` field LangChain's Anthropic adapter
    forwards (`AnthropicToolExtrasSchema`).

This way, the cached prefix covers exactly the static tool inventory.
Discovered deferred tools that arrive across turns sit *after* the
breakpoint and don't invalidate the prefix.

The wrap is non-mutating (clones with `Object.create(getProto…)` so
the original tool reference is untouched) and idempotent. 10 unit
tests in `src/messages/__tests__/anthropicToolCache.test.ts` cover
partition order, stamp behaviour, prototype preservation, extras
merging, and idempotency.

Wired in `Graph.ts` right after `resolveLocalToolsForBinding`, gated
on Anthropic + `promptCache: true`.

In practice the impact is smaller than I expected because
`addCacheControl` on the last user message ALREADY creates a cache
breakpoint that covers the tools+system prefix as part of the
cumulative cached prefix. The tool-level breakpoint becomes load-
bearing in cases where:
  - A turn arrives without a fresh user message (resume after tool
    call only),
  - The user message would otherwise blow the cache breakpoint, or
  - Cache budget needs to be split across multiple breakpoints to fit
    inside Anthropic's 4-breakpoint cap.

Still correct; still worth shipping; just not a 6× win on its own.

2. Harness accounting fix (the actual ~6× swing)
------------------------------------------------
`compare:pi`'s previous run reported ours' input tokens as ~28k–48k
per turn vs pi's ~35–80, suggesting a 6× cost gap that survived
caching. That number was wrong.

LangChain's Anthropic adapter at
src/llm/anthropic/utils/message_outputs.ts:31 reports
`usage_metadata.input_tokens` as the SUM of uncached input +
cache_creation + cache_read — i.e., the total prompt size, NOT the
uncached portion. Pi reports `usage.input` as uncached only. So our
"input new" column was double-counting cached tokens.

The harness now subtracts `cache_read + cache_creation` from
`input_tokens` when computing our "input new", so the column is
apples-to-apples with pi's `input` field.

Recomputed live results (N=2, caching on, real Anthropic):

  task                pi cost     ours cost   delta
  T1 simple-edit      $0.0158     $0.0171     +8%
  T2 fuzzy-edit       $0.0157     $0.0173     +10%
  T3 syntax-fix       $0.0210     $0.0248     +18%
  T4 type-error-loop  $0.0325     $0.0292     -10%   ← we win
  T5 multi-file       $0.0264     $0.0312     +18%
  T6 image (ours-only) —          $0.0158

Functional: pi 10/10 ✔, ours 12/12 ✔.
Wall time: ours wins T3/T4/T5 (most-complex tasks), pi wins T1/T2.
Tool calls: ours wins or ties every shared task.
Output tokens: ours wins every shared task.
Cost: pi ~10–20% cheaper on simpler tasks, ours wins the most
complex one (T4).

We're effectively at parity. The "we're 6× more expensive" claim
posted earlier was the harness error.

* feat(local-engine): workspace + exec seams + HITL on direct path

Three changes that fit together — first two answer the "what about a
second engine" + "what's the workspace" questions; the third was the
last gap blocking workspace-policy `'ask'` from working end-to-end.

1. WorkspaceFS exec seam (engine reusability)
---------------------------------------------
New `WorkspaceFS` interface in src/tools/local/workspaceFS.ts (read,
write, stat, readdir, mkdir, realpath, unlink, open) with a
`nodeWorkspaceFS` default. Every file-touching factory in the local
coding suite (read_file, write_file, edit_file, grep_search,
glob_search, list_directory, file checkpointer, binary detection
helper) now routes through this interface.

A future engine — stateful remote sandbox, in-memory test FS, ssh
jail — supplies its own `WorkspaceFS` and inherits every tool's
behavior (fuzzy-match edit, BOM/EOL preservation, syntax check,
checkpointer, attachment classifier) for free. Same story for
process spawning: the existing `local.spawn` is now nested under
`local.exec.spawn` with the legacy top-level field still honoured.

The seams are wired through three new resolvers:
  - `getWorkspaceRoots`  — canonical root + additionalRoots
  - `getReadRoots`/`getWriteRoots` — split allow-outside knobs
  - `getWorkspaceFS`     — `local.exec.fs` ?? nodeWorkspaceFS
  - `getSpawn`           — `local.exec.spawn` ?? local.spawn ?? Node

Plus 7 unit tests in src/tools/__tests__/workspaceSeam.test.ts that
verify additionalRoots, the new allowReadOutside/allowWriteOutside
split, legacy `cwd` + `allowOutsideWorkspace` back-compat, and that
a custom WorkspaceFS actually receives the file tool calls.

2. First-class workspace boundary + policy hook
-----------------------------------------------
Nested `local.workspace` config:
  workspace: {
    root: string,
    additionalRoots?: readonly string[],   // monorepo: extend boundary
    allowReadOutside?: boolean,            // split from write
    allowWriteOutside?: boolean,
  }
Back-compat: top-level `cwd` and `allowOutsideWorkspace` still work
and map to `workspace.root` / both `allowOutside*` flags.

`resolveWorkspacePathSafe(path, config, intent)` now takes a `'read'`
| `'write'` intent so reads and writes can have different allow-
outside semantics. Default callers pass `'write'` for stricter
clamping; read tools (read_file, grep_search, glob_search,
list_directory) explicitly pass `'read'`.

New `createWorkspacePolicyHook` in src/hooks/createWorkspacePolicyHook.ts
returns a `PreToolUse` callback that:
  - inspects each tool call's input via per-tool path extractors
    (defaults cover the local-engine coding suite; hosts can
    override / extend),
  - returns `allow` for in-workspace paths (incl. additionalRoots),
  - returns `'ask'` (default) / `'allow'` / `'deny'` for outside
    paths, with separate read/write policies.

This is exactly the user-visible "ask once / always" semantic
claude-code and opencode have, but built on the existing
PreToolUse + HITL machinery instead of a parallel permissions API.
14 unit tests in src/hooks/__tests__/createWorkspacePolicyHook.test.ts.

3. HITL interrupt() on the direct path (the missing piece)
----------------------------------------------------------
The previous direct-path hook commit fail-closed `'ask'` decisions
because the interrupt machinery only existed in the event-dispatch
path. The workspace policy hook above wants to use `'ask'` for
real, so this commit lifts the gap.

`runDirectToolWithLifecycleHooks` now:
  - When `humanInTheLoop.enabled === true` and PreToolUse returns
    `'ask'`: builds a single-tool `tool_approval` payload and raises
    a real LangGraph `interrupt()` (anchored to the node's
    RunnableConfig the same way `dispatchToolEvents` does — ToolNode
    disables LangSmith tracing so the AsyncLocalStorage frame must
    be re-established).
  - On resume:
      - `approve` runs the tool with the (possibly hook-rewritten) args
      - `reject` blocks via the new `blockDirectCall` helper
      - `respond` returns the host-supplied `responseText` as a
        synthetic success ToolMessage (no tool execution)
      - `edit` re-runs the tool with the host-edited args
  - When HITL is off: collapses to a fail-closed deny (matches the
    rest of the SDK's HITL-disabled default). One-time warning
    logged so hosts notice the gap.
  - `allowedDecisions` enforcement matches the event path — a
    decision type outside the policy's allowlist is fail-closed.

`blockDirectCall` is the shared helper that synthesises the Blocked
ToolMessage and fires `PermissionDenied`, used by every deny path
(deny decision, fail-closed ask, reject, allowedDecisions violation,
malformed respond).

Updated the existing fail-closed-on-ask test in
directToolHooks.test.ts to assert the new HITL-disabled behaviour
explicitly.

Live verification
-----------------
New `npm run local:workspace` script:
  - workspace.root = temp dir
  - additionalRoots = sibling temp dir
  - createWorkspacePolicyHook with outsideRead/outsideWrite='ask'
  - humanInTheLoop.enabled = true
  - asks the model to read 3 files: in-workspace, sibling, outside

Live result against real Anthropic Sonnet:
  ✔ inside.txt → workspace policy 'allow' → read INSIDE
  ✔ sibling.txt → 'allow' (additionalRoots) → read SIBLING
  ✔ outside.txt → 'ask' → INTERRUPT raised with action_request
                          carrying the path → script auto-approves
                          → tool runs → read SECRET
  Final: HITL interrupts handled = 1, successful tool messages = 3

The model didn't even need its "retry if blocked" instruction — HITL
handled the approval transparently.

Functional + non-API tests: 1245 passing (was 1224; +21 new across
workspace seam + workspace policy hook + the updated direct-tool
ask test).

* feat(common): canonical tool-name constants for local coding tools

LibreChat's `getToolIconType` (client/src/components/Chat/Messages/
Content/ToolOutput/ToolIcon.tsx) special-renders four of our tools
by exact name match: `bash_tool`, `read_file`, `execute_code`,
`run_tools_with_code`. We already use those names, so they pick up
the right icon end-to-end.

The newer local-engine tools (`write_file`, `edit_file`,
`grep_search`, `glob_search`, `list_directory`, `compile_check`)
were defined as per-file `Local*ToolName` consts — string literals
that would silently drift if anyone renamed them in only one place,
and that no consumer UI could discover programmatically.

Promotes them to `Constants.*` (single source of truth) and adds
two grouped lists alongside the workspace types so consumers can
match against canonical strings:

  - `Constants.WRITE_FILE`     ('write_file')
  - `Constants.EDIT_FILE`      ('edit_file')
  - `Constants.GREP_SEARCH`    ('grep_search')
  - `Constants.GLOB_SEARCH`    ('glob_search')
  - `Constants.LIST_DIRECTORY` ('list_directory')
  - `Constants.COMPILE_CHECK`  ('compile_check')
  - `LOCAL_CODING_TOOL_NAMES`  — the 7 local-only tools (the 6 above
                                 plus READ_FILE)
  - `LOCAL_CODING_BUNDLE_NAMES` — the local-only set + the bash/code/
                                  PTC pair the bundle wraps around
                                  the existing factories

The per-file `Local*ToolName` and `CompileCheckToolName` exports are
retained as back-compat aliases pointing at the canonical Constants
so existing consumers keep working.

Wired through:

- `createWorkspacePolicyHook` default path extractors now key off
  `Constants.*` (was string literals).
- `LocalCodingTools.ts` Local*ToolName aliases collapsed into
  `Constants.*` references.
- `CompileCheckTool.ts` likewise.

5 new unit tests in localToolNames.test.ts pin:
  - per-file aliases match Constants
  - canonical strings match the wire-level expectations LibreChat
    (and other consumer UIs) special-case
  - LOCAL_CODING_BUNDLE_NAMES matches what `createLocalCodingTools`
    actually emits
  - LOCAL_CODING_TOOL_NAMES matches what
    `createLocalCodingToolDefinitions` emits
  - bundle list is a strict superset of the local-only list

Net effect: when LibreChat (or any other consumer UI) eventually
adds icons for write_file / edit_file / grep_search / glob_search /
list_directory / compile_check, the wiring picks up automatically
without an SDK change. And renames from one side without the other
get caught at test time.

Tests: 1250 passing (was 1245; +5 new).

* fix(local): three Codex P2 review nits — bash args, rg cache, root-relative extras

Three real issues Codex flagged on the PR. Each is small but
behavior-affecting; tests pin each.

#1 — `executeLocalCode` dropped `args` when lang was bash
--------------------------------------------------------
For every other runtime (py, js, ts, php, go, rs, c, cpp, java, r,
d, f90), `getRuntimeCommand` appends `input.args` as positional
parameters. The `lang === 'bash'` short-circuit was calling
`executeLocalBash(input.code, config)` and silently discarding
`args`, so `{lang:'bash', code:'echo $1', args:['hi']}` printed an
empty string instead of `hi`.

Fix: when lang is bash AND args is non-empty, route through a new
`executeLocalBashWithArgs` helper that uses the standard
`bash -lc <code> -- <args...>` form so `$1`, `$2`, … resolve. Same
AST validation as the no-args path. The plain `executeLocalBash`
(used by the `bash_tool` factory itself) is unchanged.

Test: `codex review fixes › executeLocalCode bash args › passes
input.args as positional shell parameters when lang is bash`.

#2 — ripgrep availability cache was process-global
--------------------------------------------------
With per-Run `local.exec.spawn` backends (the seam added for the
future remote-engine), a single `let cachedRgAvailable` would let
Run A's "rg works" verdict poison Run B whose backend (e.g. a
remote sandbox without rg) doesn't have it. Run B would skip the
probe, try to spawn rg, throw, and never reach the Node fallback.

Fix: cache is now a `WeakMap<LocalSpawn, Promise<boolean>>` keyed
on the effective backend (whatever `getSpawn(config)` returns).
WeakMap lets disposed backends GC their entry; the test reset hook
re-creates the map.

Test: `codex review fixes › ripgrep cache backend scope › does not
bleed an "rg available" verdict from one backend to another`.

#3 — `additionalRoots` resolved against process.cwd, not workspace root
----------------------------------------------------------------------
With `workspace: { root: '/repo/app', additionalRoots: ['../shared'] }`,
the intended boundary is `/repo/shared`. The previous
`path.resolve(extra)` anchored to the process cwd, producing
`${process.cwd()}/../shared` — completely different on a server
with a different cwd, and could deny the legitimate sibling or
allow an unrelated directory.

Fix: `getWorkspaceRoots` and `createWorkspacePolicyHook` now both
do `isAbsolute(extra) ? resolve(extra) : resolve(root, extra)`.
Behaviour matches what users would expect from a "monorepo
sibling" config.

Test: `codex review fixes › additionalRoots resolved against
workspace root › treats relative additionalRoots as siblings of
root, not of process.cwd`.

Tests: 1254 passing (was 1250; +4 new). Lint + tsc clean.

* docs(hitl): unstale the "event-driven only" caveats; pin resume scope

Three JSDoc blocks were honest at PR #134 ship time but went stale
once `6122460` (hooks fire on the legacy non-event-driven path) and
`caed7a2` (HITL interrupt() lifted into the direct path) landed:

  - `HumanInTheLoopConfig` — "Scope: event-driven tools only" header
    plus the bullet declaring direct-path tools "bypass the hook
    system entirely. PreToolUse hooks do not fire for those tools and
    HITL approval does not gate them." Both halves are now false.
  - `ToolNodeOptions.hookRegistry` — "Only fires for event-driven
    tool calls; tools routed through directToolNames bypass hook
    dispatch entirely." Now false.
  - `ToolNodeOptions.humanInTheLoop` — "the interrupt path is only
    wired into the event-driven dispatch (dispatchToolEvents), not
    into directToolNames execution — direct tools bypass HITL
    entirely." Now false.

Also re-evaluated the "mixed direct + event batches re-execute the
direct half on resume" caveat. New tests in
`directToolHITLResumeScope.test.ts` pin the actual scope:

  - When a single tool interrupts via the direct path, its body
    runs **once** total. The first pass interrupted before reaching
    runTool; the resume pass ran the body after the host's decision
    was applied. The PreToolUse hook fires twice (the documented
    idempotency caveat).
  - When a sibling tool already ran in the same batch before the
    interrupting tool, the sibling's body runs **twice** — once on
    each pass. This applies symmetrically to event-dispatched and
    direct siblings; LangGraph rolls back the entire ToolNode
    invocation, not just the direct half.

Updated wording reflects both: HITL spans both paths uniformly, and
the side-effect warning applies to "any tool with side effects in a
batch where another tool interrupts" rather than singling out direct
tools. Hosts that want a tool to skip HITL must omit it from any
registered matcher rather than relying on the path it takes.

Tests: 1256 passing (was 1254; +2 new for resume-scope pinning).

* fix(local): two more Codex review nits — output OOM cap + bash_tool args

#4 P1 — `spawnLocalProcess` could OOM the host on noisy commands
----------------------------------------------------------------
Previous implementation accumulated every stdout/stderr chunk into
in-memory strings and only truncated post-`close`. A noisy command
(`yes`, `cat /dev/urandom | base64`, a verbose build) could stream
gigabytes before the cap kicked in — production-stability risk
since local tools execute arbitrary shell.

Fix: stream-time cap with three layers, all in `spawnLocalProcess`:

  1. Per-stream in-memory buffer capped at `maxOutputChars * 2`.
     Past that, chunks divert to a lazy temp file (the spill).
  2. The spill file is seeded with whatever was already buffered so
     it holds the FULL output (head from memory + tail from stream),
     not just the post-cap remainder.
  3. New `local.maxSpawnedBytes` config (default 50 MiB) hard-kills
     the process tree once total stdout+stderr crosses the cap.
     Independent from `maxOutputChars` (which only affects what the
     model sees) — this is the OOM backstop.

`finish()` now waits for the spill stream to flush before resolving,
so the model never sees `full_output_path: …` for a still-being-
written file.

Tests pinned: `codex review fixes (round 2) › streaming output cap`:
  - `yes` gets killed in ~200ms when capped at 64 KiB (vs the 30s
    timeout that would otherwise apply)
  - 200 KiB output with an 8 KB inline cap successfully spills; the
    file holds more than the in-memory truncation
  - small outputs do NOT create a spill file (no perf regression)

#5 P2 — `bash_tool` factory broke positional shell parameters
-------------------------------------------------------------
Same shape as the previous `executeLocalCode` bash-args fix
(`9b5fb85`), but in the `bash_tool` factory itself. It was
appending args literally to the command string:

    `${command} ${args.map(shellQuote).join(' ')}`

So `command: 'echo "$1"'` with `args: ['hi']` ran
`echo "$1" hi` — `$1` was empty, `hi` got appended literally.
Tool schema advertised args as execution arguments; the actual
behavior didn't match.

Fix: route through the `executeLocalBashWithArgs` helper (now
exported from LocalExecutionEngine) which uses
`bash -lc <command> -- <args...>` so `$1`, `$2`, … resolve
correctly. Falls back to the no-args `executeLocalBash` when args
is empty (no behavior change for that case).

Test: `bash_tool args › populates positional shell parameters from
input.args` (and a second case for missing args).

Tests: 1261 passing (was 1256; +5 new).

* fix(local): three more Codex nits — config.shell, probe cache scope, grep -e

#6 P1 — validateBashCommand hard-coded DEFAULT_SHELL
----------------------------------------------------
The `bash -n -c <command>` syntax preflight was hard-coded to
DEFAULT_SHELL ('bash' on POSIX, 'bash.exe' on Windows), so a Run
configured with `local.shell: '/bin/sh'` (or pointing at zsh, dash,
or any host where bash isn't available) would get its commands
syntax-validated against a different binary than the one that would
actually execute them. Valid commands could be rejected before
execution; commands relying on bashisms could be falsely accepted
on hosts without bash.

Fix: route the syntax preflight through `config.shell ?? DEFAULT_SHELL`
so validation uses the same binary the actual execution path uses.

Test: `codex review fixes (round 3) › validateBashCommand honours
configured shell › routes the -n preflight through local.shell when
set`. Intercepts the spawn calls and asserts the syntax check used
`/bin/sh`, not the DEFAULT_SHELL fallback.

#7 P2 — syntax-check probe cache was process-global
---------------------------------------------------
Same shape as the ripgrep-cache fix in `9b5fb85`: per-Run
`local.exec.spawn` backends mean a "node available" / "python
available" / "bash available" verdict from one backend's probe
could poison subsequent Runs whose backend (e.g. a remote sandbox
that has python but not node) doesn't have those tools.

Fix: cache is now `WeakMap<LocalSpawn, ProbeCache>` keyed on the
effective spawn backend (whatever `getSpawn(config)` returns).
Disposed backends GC their entry; the test reset hook re-creates the
map. Mirrors the ripgrep cache exactly.

Test: `codex review fixes (round 3) › syntax-check probe cache is
backend-keyed › does not bleed an "rg/node/python available" verdict
from one backend to another`.

#8 P2 — grep pattern parsed as flag when dash-prefixed
------------------------------------------------------
The `grep_search` tool sent `input.pattern` as a positional arg to
`rg`. A pattern like `-foo` got parsed as an unknown flag and rg
bailed out instead of searching for it. `rg --help` explicitly
requires `-e/--regexp` (or `--`) for dash-prefixed patterns.

Fix: pass the pattern via `-e <pattern>`. Same trick also defends
against any future flag-conflict if a user query happens to look like
an rg long option. The Node fallback path doesn't need this — JS
RegExp accepts dash-prefixed patterns natively.

Test: `codex review fixes (round 3) › grep passes pattern via -e ›
handles dash-prefixed patterns without rg interpreting them as flags`.

Tests: 1264 passing (was 1261; +3 new).

* fix(local): two more P1 Codex nits — quoted destructive + symlink-escape

#9 P1 — dangerous-command gate missed quoted targets
----------------------------------------------------
`validateBashCommand` checks `dangerousCommandPatterns` against
`normalized` — the post-`stripQuotedContent` form. That pass blanks
the contents of every quoted span, so destructive targets that the
LLM (accidentally or not) wraps in quotes — `rm -rf "/"`,
`rm -rf "$HOME"`, `chmod -R 777 "/"` — slip past the regex even
though they execute identically to the bare form. Real safety
regression on the default-on guard.

Fix: split the destructive check into two passes.

  1. The existing `dangerousCommandPatterns` continue to run against
     `normalized`. Same false-positive defence (`echo "rm -rf /"`
     stays valid because the destructive text is wrapped by the
     OUTER quote pair, not by quotes around the path argument).
  2. New `quotedDestructivePatterns` run against the ORIGINAL
     command string. Each pattern explicitly REQUIRES a matching
     quote pair around the destructive path (`"/"`, `"$HOME"`,
     `'/'`, etc.). `echo "rm -rf /"` doesn't match — the `/` isn't
     surrounded by quotes there, only the whole `rm -rf /` string is.

Tests: `codex review fixes (round 4) › quoted destructive targets`
covers `"/"`, `"$HOME"`, `'/'`, `chmod -R 777 "/"`, no-regression
on bare forms, and the `echo "rm -rf /"` no-false-positive case.

#10 P1 — workspace policy hook compared lexical paths only
----------------------------------------------------------
`createWorkspacePolicyHook` did `isAbsolute(p) ? resolve(p) :
resolve(root, p)` and called it done. A symlink inside the workspace
pointing outside (e.g. `workspace/escape → /etc/secret`) lexically
appears under `workspace/` and got auto-allowed — even though the
agent reading through it touches a path the policy meant to gate.

Critical when the hook is the primary gate: the documented "ask the
user" setup turns `allowReadOutside: true` so the file tools' own
clamp is off, leaving the hook as the sole defence.

Fix: the hook now realpaths both the candidate and the workspace
roots before comparing. Roots are pre-realpath'd once at construction
(memoized via a Promise so the per-call cost is one realpath) and
candidate paths get the same `realpathOfPathOrAncestor` walk
`resolveWorkspacePathSafe` already uses — so paths that don't yet
exist (e.g. `write_file` to a brand-new path) still get checked
against their nearest existing ancestor's realpath.

Two test cases pin both halves:
  - `workspace/escape → extra/secret.txt` — hook DENIES read of
    `escape` even though it's lexically inside.
  - `extra/alt-mount → workspace/` — hook ALLOWS read of
    `extra/alt-mount/file.ts` because the realpath lands inside the
    workspace root (covers the alternate-mount case so we don't
    over-correct).

Tests: 1272 passing (was 1264; +8 across both fixes).

* fix(local): treat maxSpawnedBytes=0 as unlimited (Codex P2 #11)

The public LocalExecutionConfig contract documents `maxSpawnedBytes: 0`
as "no cap", but the kill check `totalSpawnedBytes > hardKillBytes`
triggered on the very first byte when the cap was zero. Skip the kill
path entirely when `hardKillBytes <= 0` so configs that explicitly opt
out of the OOM backstop behave as documented.

* chore(scripts): forward ON_RUN_STEP in pi-vs-ours harness

Without an ON_RUN_STEP handler the aggregator's stepMap is empty when
ON_RUN_STEP_COMPLETED arrives, so every tool call logged "No run step
or runId found for completed step event" to stderr. Mirror the pattern
used in src/scripts/code_exec.ts (and friends): register both events.
Harness-only — no SDK behavior change.

* fix(local): ESM-safe spill + surface rg failures in glob_search (Codex P1 #12, P2 #13)

Two issues from the latest review pass:

1) `spawnLocalProcess`'s spill path used `require('fs')` inside an
   ESM-shipped module. Fine in CJS test runs, throws
   `ReferenceError: require is not defined` for any ESM consumer that
   triggers the overflow path. Switch to a static
   `import { createWriteStream } from 'fs'` so both build outputs work.

2) `createLocalGlobSearchTool` ignored ripgrep's exit code and stderr,
   so `rg --files <missing-target>` (exit 2) silently mapped to "No
   files found." — the agent then treated a tooling failure as a real
   absence of matches. Now we check `result.exitCode > 1` (and
   `timedOut`) and return an explicit `glob_search failed: <stderr>`
   string + an artifact carrying the error.

Tests pinned: `codex review fixes (round 5) › spill path is ESM-safe`
and `… › glob_search surfaces ripgrep failures`. The glob test uses an
injected spawn backend so it runs deterministically on hosts without
ripgrep installed.

* fix(local): sandbox loopback bridge + additionalRoots writes (Codex P1 #14, P2 #15)

Two issues with how `buildSandboxRuntimeConfig` lays out the
SandboxRuntimeConfig:

1) The local programmatic-tool bridge listens on 127.0.0.1, but the
   sandbox default `allowedDomains: []` denies all outbound network —
   so `run_tools_with_code` / `run_tools_with_bash` fail under sandbox
   even though they work unsandboxed. Seed allowedDomains with
   `127.0.0.1`, `localhost`, `::1` (skip any host the user explicitly
   listed in `deniedDomains`, and dedupe against user-supplied
   `allowedDomains`).

2) The sandbox `allowWrite` allowlist was seeded with `cwd` only, so
   `workspace.additionalRoots` paths were resolvable by file tools but
   blocked for sandboxed shell/code. Now seed from `getWorkspaceRoots`
   so both surfaces share the same boundary.

Exported `buildSandboxRuntimeConfig` so the round-6 tests can drive it
directly without standing up the real native sandbox runtime. Test
count: 1283 passing (was 1276). Lint baseline unchanged.

* fix(direct-path): HITL edit reads updatedInput; PostToolUse updates registry (Codex P1 #16, P2 #17)

Two bugs in `runDirectToolWithLifecycleHooks`:

1) The `edit` decision branch read `decision.args`, but the documented
   `ToolApprovalDecision` shape uses `updatedInput`. Hosts following
   the public type signature were silently ignored — the tool ran
   with the original (un-edited) arguments. Now mirrors the
   event-driven path's validation: read `updatedInput`, fail closed
   on missing/wrong-shaped payloads (returns an error ToolMessage
   instead of executing with undefined args).

2) When a `PostToolUse` hook returns `updatedOutput`, the direct
   path returned a new ToolMessage with the replacement content but
   never updated the `toolOutputReferences` registry. Subsequent
   `{{tool<i>turn<n>}}` substitutions then delivered the stale
   pre-hook bytes while the model observed the post-hook
   replacement. Now reads `_refKey`/`_refScope` from the message
   metadata (already stamped by `recordOutputReference`) and calls
   `toolOutputRegistry.set` with the replacement.

Tests: `direct-path HITL: resume scope > edit decision (Codex P1 #16)`
covers both the happy path (edited args reach the body) and the
fail-closed branch. `ToolNode tool output references > PostToolUse
updatedOutput updates the registry (Codex P2 #17)` chains two calls
where the second references the first via {{tool0turn0}} and asserts
the substituted value matches the post-hook content.

* fix: snapshot direct-batch args + curl-first bash bridge (Codex P1 #18, P2 #19)

P1 #18 (ToolNode.ts): direct-path tools resolved {{tool…turn…}}
placeholders against the LIVE registry inside `runTool`, which runs
AFTER awaiting PreToolUse hooks. In a `Promise.all`-driven batch a
slow hook on call A could let call B finish first and register its
output, then A's late re-resolve would substitute B's output into A's
args — order-dependent leakage that violates same-turn isolation.
Snapshot the registry once per batch in `run()` (mirrors the
event-driven path), thread it through `RunToolBatchContext`, and have
both `runDirectToolWithLifecycleHooks` and `runTool` resolve against
the snapshot when present.

P2 #19 (LocalProgrammaticToolCalling.ts): the bash bridge required
`python3` on PATH, breaking `run_tools_with_bash` on minimal
containers and Windows hosts without python3 installed. Bridge now
serves a `?mode=text` variant that returns the already-serialized
result body so bash callers can use `curl` (universally available)
without pulling in a JSON parser. Bash helper tries curl first, falls
back to python3 for environments without curl, errors helpfully if
neither is available.

Tests pinned: `ToolNode tool output references › direct-batch
snapshot isolation (Codex P1 #18)` (slow PreToolUse hook + sibling
finishes-first scenario), and `ProgrammaticToolCalling › bash bridge
script does not require python3 (Codex P2 #19)` (asserts both helper
branches present in the generated script with curl preferred).

* fix(local): block `--` end-of-options + gate compile_check overrides (Codex P1 #20, P1 #21)

P1 #20: the destructive-command guard required the target path to
follow option flags directly, so `rm -rf -- "/"` (and similarly
`chmod -R 777 -- "/"`) bypassed the check via the GNU/BSD
end-of-options marker. Both `dangerousCommandPatterns` and
`quotedDestructivePatterns` now allow an optional `(?:--\\s+)?`
between the flags and the destructive target.

P1 #21: `compile_check` ran the resolved command through `bash -lc`
without going through `validateBashCommand`, so a host-supplied
`command` override (or `compileCheck.command` config entry) could
execute mutating/destructive commands even with `readOnly: true` or
the dangerous-command guard normally in force. Now routes through
`validateBashCommand(detection.command, config)` first; on failure
returns an explanatory ToolMessage with `ran: false` rather than
spawning the process.

Tests pinned: `codex review fixes (round 6) › destructive guard
handles `--` end-of-options` (5 cases including a benign-`--`
no-regression check) and `… › compile_check enforces
validateBashCommand + readOnly` (rm -rf, touch under readOnly,
benign echo). Lint baseline unchanged.

* chore: audit-pass cleanups + missing test coverage (manual review #1, #3, #5, #7, #9, #10, #11)

Comprehensive-review audit findings, fixed in one pass:

#1 (attachments full-file MIME sniff): switch classifyAttachment to
   open() + 12-byte header read for sniffing; only readFile() the
   full buffer when we're about to embed. A 9 MB image now allocates
   12 bytes for sniffing instead of 9 MB.

#3 (fallback walker SKIP_DIRS too small): expanded the Node-fallback
   walker's skip set to cover dist/build/.next/.cache/__pycache__/
   .venv/venv/target/vendor/coverage/.tox/.mypy_cache and friends.
   ripgrep itself respects .gitignore so this only affects the
   fallback.

#5 (dynamic fs/promises import): replaced two `await import('fs/
   promises').then(...)` calls in CompileCheckTool with a static
   `import { readFile, stat } from 'fs/promises'` per AGENTS.md.

#7 (test-only resets in public API): added @internal JSDoc to all
   three `_reset*ForTests` functions so IDE autocomplete signals
   they aren't part of the SDK surface. (The leading underscore was
   convention-only; @internal is the standard tag.)

#9 (legacy fields missing @deprecated): added @deprecated tags to
   `LocalExecutionConfig.cwd` and `allowOutsideWorkspace` pointing
   to their workspace.* replacements.

#10 (dead lexicallyInside compute in workspace policy hook): dropped
   the unused variable. Realpath is the source of truth for both
   the symlink-escape and alternate-mount cases; the lexical check
   provides no information so we don't pay for it.

#11 (voided unused imports in pi comparison script): removed the
   `void readdir; void cp;` workaround and the dead imports they
   were silencing.

Bonus: also addressed the comprehensive review's #2 (bitwise OR) by
verifying it was a misread — the code uses `||`, not `|`. Manual
review's reading was wrong; no fix needed.

Larger findings addressed:

#4 (no editStrategies tests): added
   `src/tools/local/__tests__/editStrategies.test.ts` with 12 cases
   covering exact / line-trimmed / whitespace-normalized /
   indentation-flexible match boundaries, start/end-of-file matches,
   unicode content, multi-line needles spanning blank lines,
   ambiguous-match rejection, and applyEdit semantics.

#12 (no FileCheckpointer tests): added
   `src/tools/local/__tests__/FileCheckpointer.test.ts` with 6
   cases covering capture+rewind happy path, idempotent capture,
   absent-file rewind (deletion), multi-file rewind, oversize-file
   tracking-but-not-snapshotting, and rewind across a deleted
   parent directory.

Performance:

#8 (lineWindow split-whole-file): replaced `content.split('\\n')`
   in `lineWindow` with a newline-walk that's O(start + limit)
   instead of O(file). For a 10 MB file with `offset: 1, limit: 10`
   this drops from "split the whole file into millions of strings"
   to ~10 indexOf calls. Falls back to the simple split when no
   limit is supplied.

Deferred:

#6 (spill files never cleaned) needs a Run-lifecycle hook to run
   cleanup; tracked separately rather than fixed inline.

* fix: comprehensive review (round 7) — bridge hooks, sandbox latch, nested-shell, regex DoS, attachments via WorkspaceFS, expose checkpointer

Codex P2 (sandbox warning latch): syntax preflight, ripgrep-availability,
and node/python/bash probes used to spawn through the public path that
fires `maybeWarnSandboxOff`, both emitting a misleading "sandbox is off"
warning when the run had `sandbox.enabled: true` AND latching
`sandboxOffWarned = true` so a later genuinely-unsandboxed execution
silently skipped its warning. Added an opt-in `{ internal: true }`
options arg to spawnLocalProcess that suppresses both. Threaded it
through every internal probe.

Manual #A (P1): the in-process programmatic-tool bridge invoked inner
tools via `executeTools` directly, bypassing every PreToolUse hook the
host registered. ToolNode now plumbs a `hookContext` (registry +
runId/threadId/agentId) into the programmatic-tool factory; the bridge
runs PreToolUse before each inner tool.invoke. `deny` and `ask` (HITL
not reachable from inside an HTTP handler) fail closed; `updatedInput`
is applied to the inner tool args. 4 tests pinned.

Manual #B (P1): already fixed in c6e2632 (round 9 — compile_check
routes through validateBashCommand). Reviewer was at 75a54b3, predates
that commit.

Manual #C (P1): `bash -lc "rm -rf $HOME"` (and friends) bypassed the
destructive-command guard because `stripQuotedContent` blanks the
nested payload before the patterns run. Added a third pattern list
`nestedShellDestructivePatterns` that runs against the original
command and matches destructive ops (`rm -rf`, `chmod -R 777`,
`chown -R`) inside `<shell> -[l]?c "..."` and `eval "..."` payloads.
4 tests including a benign-nested-echo no-regression check.

Manual #D (P2): `fallbackGrep` compiled the model-supplied pattern via
`new RegExp` and ran it across every file, opening the door to
catastrophic-backtracking DoS (`(a+)+$`-style patterns). Added a
guardrail layer: 1024-char pattern length cap, nested-quantifier
heuristic rejection, 5-second wall-clock budget for the overall
search. Surfaced as a structured `FallbackGrepError` -> tool result
with engine: 'node-fallback' + the kind tag. Hosts that need
bulletproof regex safety should install ripgrep (RE2-based, no
backtracking). 2 tests pinned.

Manual #E (P2): `local.fileCheckpointing: true` in RunConfig was a
silent no-op in the auto-bind path — `createLocalCodingTools()` made
a checkpointer that was immediately discarded. Now the auto-bind path
uses `createLocalCodingToolBundle()` when checkpointing is on,
returns the checkpointer through `ResolveLocalToolsResult`, and
ToolNode stashes it + exposes via `getFileCheckpointer()`. 2 tests
pinned.

Manual #F (P3): `classifyAttachment` used host `fs/promises` directly
instead of the configured `WorkspaceFS`. A custom/remote engine could
fail to embed valid attachments OR accidentally read a host path with
the same absolute name. Added optional `fs` arg to classifyAttachment
threaded from `read_file` through to its `WorkspaceFS`. Backward
compatible — defaults to host fs/promises when no fs is supplied.

Tests touched: 768 passing across all suites (was 750ish). Lint
baseline unchanged.

* chore: address audit-of-audit findings (4/4 valid)

Follow-up audit on commit ebf8b6a flagged 4 nits/minors. All valid:

F1 (NIT): the lineWindow perf fix introduced exactly the same
  `void <var>` pattern that finding #10 had me remove from
  createWorkspacePolicyHook. Embarrassing — dropped the dead
  `line` variable + `line++` + `void line`.

F2 (MINOR): the editStrategies test commit message claimed coverage
  of "ambiguous-match rejection" but no test actually exercised the
  multi-match → null path. Added the missing case for the exact
  strategy. Noted in a comment that the looser strategies in the
  chain may still resolve duplicates unambiguously by falling
  through; whether that's correct is a separate design call.

F3 (NIT): the new FileCheckpointer.test.ts had a `import('fs/
  promises')` inside a catch block when `mkdir` was already
  available via the static import at the top. Static-imported
  `mkdir` and restructured the test to mkdir-then-writeFile
  unconditionally (the catch-and-retry pattern was unnecessary
  defensive plumbing).

F4 (NIT): the unused-imports cleanup missed `sum`/`void sum` in
  the comparison harness. Removed the function and its void.

* fix: expose file checkpointer through Run/Graph (audit follow-up)

Round-7 finding E only fixed half the problem: ToolNode got
`getFileCheckpointer()` so direct `new ToolNode(...)` users could
reach it, but the normal `Run.create(...)` path constructed
ToolNodes inline inside `StandardGraph.initializeTools` and dropped
the reference. So `RunConfig.toolExecution.local.fileCheckpointing:
true` was still a no-op for Run callers — and the public JSDoc on
`fileCheckpointing` referenced a `Run.rewindFiles()` that didn't
exist.

Now:

- Graph adds `getOrCreateFileCheckpointer()` — lazily constructs ONE
  per-Run checkpointer when local fileCheckpointing is on. Cleared
  in `clearHeavyState()` along with the other per-Run state.

- `StandardGraph.initializeTools` threads it into BOTH ToolNode
  constructions (the agentContext branch and the traditional-tools
  branch). Multi-agent graphs end up sharing a single snapshot
  store, so `rewind()` reaches writes from any agent.

- ToolNode accepts `fileCheckpointer?` in its constructor and
  passes it to `resolveLocalExecutionTools`, which forwards into
  `createLocalCodingToolBundle({ checkpointer })`. Caller-provided
  wins over the auto-created one.

- Run gains `getFileCheckpointer()` and `rewindFiles()` — the
  documented APIs. `rewindFiles()` returns 0 when checkpointing is
  disabled (no-op rather than throwing).

- `LocalExecutionConfig.fileCheckpointing` JSDoc updated to
  reference the now-real APIs.

Tests pinned: `… › fileCheckpointer reachable through
Run.getFileChec…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant